{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "80c6552d",
   "metadata": {},
   "source": [
    "\n",
    "# Advanced Biostatistical Techniques: PCA, Feature Selection, and SEM\n",
    "\n",
    "This notebook explores advanced biostatistical techniques, including Principal Component Analysis (PCA), random forest-based feature selection, and Structural Equation Modeling (SEM).\n",
    "\n",
    "## Dataset Information\n",
    "The Mice Protein Expression dataset is used, which can be accessed from:\n",
    "- [UCI Machine Learning Repository](https://archive.ics.uci.edu/dataset/342/mice+protein+expression)\n",
    "- [Kaggle](https://www.kaggle.com/datasets/ruslankl/mice-protein-expression)\n",
    "\n",
    "### Prerequisites\n",
    "Install the following packages before proceeding:\n",
    "```bash\n",
    "pip install pandas numpy scikit-learn matplotlib seaborn semopy graphviz\n",
    "```\n",
    "        "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "435937ee",
   "metadata": {},
   "source": [
    "\n",
    "## Data Loading and Preprocessing\n",
    "\n",
    "The dataset contains protein expression levels and associated biological class labels. We'll preprocess the data by imputing missing values and encoding categorical variables.\n",
    "        "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b1d697ab",
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "import pandas as pd\n",
    "from sklearn.impute import SimpleImputer\n",
    "from sklearn.preprocessing import LabelEncoder\n",
    "from sklearn.decomposition import PCA\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "# Load dataset (replace with dataset file path)\n",
    "from ucimlrepo import fetch_ucirepo\n",
    "mice_protein_expression = fetch_ucirepo(id=342)\n",
    "\n",
    "# Extract features and targets\n",
    "X = mice_protein_expression.data.features.iloc[:, :-3]\n",
    "y = mice_protein_expression.data.targets\n",
    "\n",
    "# Impute missing values\n",
    "imputer = SimpleImputer(strategy=\"mean\")\n",
    "X_imputed = imputer.fit_transform(X)\n",
    "\n",
    "# Encode target variable\n",
    "label_encoder = LabelEncoder()\n",
    "y_encoded = label_encoder.fit_transform(y)\n",
    "\n",
    "print(f\"Dataset Shape: {X_imputed.shape}\")\n",
    "print(f\"Classes: {label_encoder.classes_}\")\n",
    "        "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d6bfc388",
   "metadata": {},
   "source": [
    "\n",
    "## Principal Component Analysis (PCA)\n",
    "\n",
    "PCA reduces dimensionality by transforming data into principal components that explain the most variance.\n",
    "        "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a6eadf92",
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "# Perform PCA\n",
    "pca = PCA(n_components=2)\n",
    "X_pca = pca.fit_transform(X_imputed)\n",
    "\n",
    "# Create a DataFrame for PCA results\n",
    "pca_df = pd.DataFrame(X_pca, columns=[\"PC1\", \"PC2\"])\n",
    "pca_df[\"Target\"] = y_encoded\n",
    "\n",
    "# Plot PCA results\n",
    "plt.figure(figsize=(10, 8))\n",
    "categories = label_encoder.classes_\n",
    "colors = plt.cm.viridis(range(len(categories)))\n",
    "\n",
    "for i, category in enumerate(categories):\n",
    "    subset = pca_df[pca_df[\"Target\"] == i]\n",
    "    plt.scatter(subset[\"PC1\"], subset[\"PC2\"], label=category, color=colors[i])\n",
    "\n",
    "plt.xlabel(\"Principal Component 1\")\n",
    "plt.ylabel(\"Principal Component 2\")\n",
    "plt.title(\"PCA of Mice Protein Expression Dataset\")\n",
    "plt.legend(title=\"Category\")\n",
    "plt.show()\n",
    "        "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "90820e2f",
   "metadata": {},
   "source": [
    "\n",
    "## Random Forest for Feature Selection\n",
    "\n",
    "Random forest identifies the most important features contributing to classification, which helps simplify complex datasets.\n",
    "        "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c5317451",
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "from sklearn.ensemble import RandomForestClassifier\n",
    "\n",
    "# Train random forest\n",
    "rf = RandomForestClassifier(n_estimators=100, random_state=42)\n",
    "rf.fit(X_imputed, y_encoded)\n",
    "\n",
    "# Get top 10 features\n",
    "feature_importances = pd.Series(rf.feature_importances_, index=X.columns)\n",
    "top_10_features = feature_importances.nlargest(10).index\n",
    "\n",
    "# Filter data to include top 10 features\n",
    "X_top_10 = pd.DataFrame(X_imputed, columns=X.columns)[top_10_features]\n",
    "print(\"Top 10 Features:\")\n",
    "print(top_10_features)\n",
    "        "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "de5b8370",
   "metadata": {},
   "source": [
    "\n",
    "## Structural Equation Modeling (SEM)\n",
    "\n",
    "SEM models relationships among latent variables. Here, we analyze latent variables related to cell division and apoptosis signaling pathways.\n",
    "        "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "76f99820",
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "from semopy import Model, calc_stats, semplot\n",
    "\n",
    "# Define SEM model\n",
    "model_desc = '''\n",
    "Cell_division_signaling =~ pAKT_N + pBRAF_N + pCREB_N\n",
    "Apoptosis_signaling =~ MTOR_N + P38_N\n",
    "Cell_division_signaling ~ Apoptosis_signaling\n",
    "'''\n",
    "\n",
    "# Fit SEM model\n",
    "model = Model(model_desc)\n",
    "model.fit(pd.DataFrame(X_top_10, columns=top_10_features))\n",
    "\n",
    "# Calculate fit statistics\n",
    "fit_stats = calc_stats(model)\n",
    "print(f\"CFI: {fit_stats['CFI']}\")\n",
    "        "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "79e50d19",
   "metadata": {},
   "source": [
    "\n",
    "## Visualizing SEM\n",
    "\n",
    "The SEM plot shows relationships between latent variables and observed features.\n",
    "        "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "46ecb2b1",
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "semplot(model, \"sem_diagram.png\", show=\"estimates\")\n",
    "plt.imshow(plt.imread(\"sem_diagram.png\"))\n",
    "plt.axis(\"off\")\n",
    "plt.title(\"SEM Path Diagram\")\n",
    "plt.show()\n",
    "        "
   ]
  }
 ],
 "metadata": {},
 "nbformat": 4,
 "nbformat_minor": 5
}
